## [1] 4898 13
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
This report explores a data set containing quality ratings and attributes for approximately 4,900 white wines. The dataset contains 13 variables. 11 variables are of type “num” and two variables are of type “int”.
Quality Description: Output score based on sensory data.
Quality Measure: Score between 0 and 10.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
The quality data is normally distributed. There are few observations for wines of low and high quality (values 3, 4, 8, and 9). The majority of the observations fall in the middle between 5 - 7. This tells me that most wines are of average quality, it’s not likely to find a wine that of terrible or great quality.
In order to observe reliable trends in the data I want to make sure there are a significant amount of observations in each quality bucket. Hence, I created a new categorical variable named “quality.rating” which contains the following rating groupings: “Bad”, “Average”, and “Good”. Bad: quality scores 3 - 5; Average: quality score 6; Good: quality scores 7 - 9.
Now each quality rating bucket has a significant number of observations as you can see by the bar graph and table print outs below.
##
## Bad Average Good
## 1640 2198 1060
Fixed Acidity Description: Most acids involved with wine are fixed or nonvolatile (do not evaporate readily).
Fixed Acidity Measure: tartaric acid - g / dm^3.
Per Calwineries - Tartaric Acid is one of the strongest acids in wine and controls the acidity of a wine. The total acidity in a wine is measured by the amount of Tartaric Acid present. Tartaric Acid plays a critical role in the taste, feel and color of a wine. But even more important, it lowers the pH enough to kill undesirable bacteria, acting as a preservative. The influence of tartaric acid on the taste and feel of a wine is primarily through its impact on acidity. It contributes to the “tartness” of a wine, but not as much as malic and citric acid. Winemakers will adjust acidity by adding tartaric acid to the wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Operations:
I limited the x axis to get a better view of the data distribution.
Observations:
The data is normally distributed. The majority of the data falls between the values of 5 and 9. There are a couple major outliers to the far right as high as 14.2.
Volatile Acidity Description: The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
Volatile Acidity Measure: acetic acid - g / dm^3
Per Calwineries - The normal level of acetic acid in wine is around 300mg/liter. Around this level, acetic acid is very desirable, contributing to the wines smell and taste. As it increases above this critical number, it gradually gives the wine a sour taste, and the perception of vinegar becomes more apparent.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Operations:
I limited the x axis to get a better view of the data distribution and transformed the x-axis using the square root function.
Observation:
The data has a normal distribution after transforming the x-axis with the square root function. The majority of the data falls between the range of 0.1 and 0.5. There are some outliers to the right ranging as high as 1.1.
Citric Acid Description: Found in small quantities, citric acid can add ‘freshness’ and flavor to wines.
Citric Acid Measure: g / dm^3
Per Calwineries - Citric acid plays a major role in a winemakers influence on acidity. Many winemakers use citric acid to acidify wines that are too basic and as a flavor additive. This process has is benefits and drawbacks. Adding citric acid will give the wine “freshness” otherwise not present and will effectively make a wine more acidic. The major disadvantage of adding citric acid is its microbial instability. As mentioned earlier, bacteria use citric acid in their metabolism, thus the citric acid added may just be consumed by bacteria, promoting the growth of unwanted microbes.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Operations:
I limited the x-axis to get a better view of the bulk of the data. I also used facet wrap of quality.rating to see if the spikes in data around 0.49 and 0.74 had any relation to the quality of the wine.
Observations:
The data is normally distributed. The majority of the data falls between the values of 0.15 and 0.55. There are a couple major outliers to the far right with values of 1.23 and 1.66 respectively. There are unusual spikes in wine count at the values 0.49 and 0.74. I used a facet wrap to see if the spikes had any relation to the quality of the wine, but there didn’t appear to be any correlation.
Residual Sugar Description: The amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.
Residual Sugar Measure: g / dm^3
Per Calwineries - Sugar in wine plays a major role in its sensory characteristics. It can be as obvious as the difference between a sweet and dry wine; or as subtle as the difference between sugar interactions with different tannins. In dry wine, yeasts consume almost all of the sugar from the grapes. In sweet wine, the yeasts are killed before all the sugar is used, leaving residual sugars.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Operations:
I transformed the x-axis using the log10 function to get a better view of the bulk of the data smashed on the left side of the plot. I limited the x-axis to 20 in order to remove the major outliers to the right and focus on the bulk of the data. I also used a facet wrap to see if the peaks in the data had any relation to the quality of the wine.
Observations:
The data is not normally distributed, but rather has a multimodal distribution. The bulk of the data was smashed on the left side of the plot, but there was a significant amount of data spread out along the x-axis from 3 - 20 with several noticeable peaks. I used a facet wrap to see if the peaks in the data had any relation to the quality of the wine, but there didn’t seem to be much correlation.
Chlorides Description: The amount of salt in the wine.
Clorides Measure: sodium chloride - g / dm^3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Operations:
I limited the x-axis to get a better view of the distribution for the bulk of the data.
Observations:
The data is not normally distributed. The bulk of the data appears to be bimodal with peaks around 0.0375 and 0.046. The bulk of the data falls betweeen 0.02 and 0.065. There are outliers as far out as 0.346.
Free Sulfur Dioxide Description: The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.
Free Sulfur Dioxide Measure: mg / dm^3
Per Calwineries - A small amount of SO2 is produced naturally as a by product of fermentation, but most of the SO2 has been added by the winemaker. During white wine production, it is added at almost every stage of the process, and is more or less required after malolactic fermentation is complete. The most important mechanism of action for Sulfur Dioxide is as an anti-microbial agent. It regulates the growth of harmful yeast and bacterial growth in the wine. Another important role of Sulfur Dioxide lies in its anti-oxidant properties. This guards against browning and protects the fruit-like qualities of the wine. If a winemaker uses too much SO2, it can kill the “good” yeast, haulting fermentation before the desired end point. It can also stop malolactic fermentation from completing, yield wines that taste unfinished. You can tell a wine that has too much Sulfur Dioxide by its characteristically pungent odor. It smells similar to that of a recently struck match.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
Operations:
I limited the x-axis to get a better view of the bulk of the data.
Observations:
The data has a normal distribution with a long tail to the right. The bulk of the data falls betweeen 5 and 70. There are many outliers to the right, ranging as far out as 289.
Total Sulfur Dioxide Description: Amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
Total Sulfur Dioxide Measure: mg / dm^3
Per Calwineries - A small amount of SO2 is produced naturally as a by product of fermentation, but most of the SO2 has been added by the winemaker. During white wine production, it is added at almost every stage of the process, and is more or less required after malolactic fermentation is complete. The most important mechanism of action for Sulfur Dioxide is as an anti-microbial agent. It regulates the growth of harmful yeast and bacterial growth in the wine. Another important role of Sulfur Dioxide lies in its anti-oxidant properties. This guards against browning and protects the fruit-like qualities of the wine. If a winemaker uses too much SO2, it can kill the “good” yeast, haulting fermentation before the desired end point. It can also stop malolactic fermentation from completing, yield wines that taste unfinished. You can tell a wine that has too much Sulfur Dioxide by its characteristically pungent odor. It smells similar to that of a recently struck match.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Operations:
I limited the x-axis to zoom in on the bulk of the data.
Observations:
The data is normally distributed for the most part with a long tail to the right. The bulk of the data falls between 50 and 250.There are some outliers ranging as far out as 440.
Density Description: The density of wine is close to that of water depending on the percent alcohol and sugar content.
Density Measure: g / cm^3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Operations:
I limited the x-axis to get a better view of the bulk of the data. I used a facet wrap to see if the distribution had any relation to the quality of the wine.
Operations:
The data distriution is not normally distributed. The data appears to be multimodal with 3 or 4 noticeable peaks. I used a facet wrap to see if the distribution had any relation to the quality of the wine and I immediately noticed that the higher quality wines had density lower than 0.995, while 0.995 lined up closer to the middle of the distributions for the “Poor” and “Good” quality wines. This tells me that Density plays a role in the wine quality.
pH Description: Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
Per Calwineries - “pH” is a term that quantifies the acidity of a solution. Another important aspect of pH in wine is that it represents the active acid in wine, or the acid that contributes to the fixed acidity.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
Operations:
None
Observations:
The data is normally distributed. The bulk of the data is between 2.9 and 3.5.
Sulphates Description: A wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.
Sulphates Measure: potassium sulphate - g / dm3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
## $title
## [1] "Sulphates (log10) Histogram"
##
## attr(,"class")
## [1] "labels"
Operations:
I transformed the x-axis using the log10 scale in order to remove the long tail to the right.
Observations:
The data appears to be normally distributed after transforming the scale using the log10 scale. The bulk of the data falls between 0.3 and 0.7. There are some noticeable spikes in the wine count at certain values like 0.38 and 0.5.
Alcohol Description: The percent alcohol content of the wine.
Alcohol Measure: % by volume
Per Calwineries - The presence of alcohol with sugars, phenols and tannins define the balance of wine. When there is too much alcohol relative to other components, this equation is out of balance, and the wine is considered “hot.” It is generally accepted with fortified wines, but is not desired otherwise.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
Operations:
I used a facet wrap to see if the spikes had any relation to the quality of the wine.
Observations:
The data appears to be multimodal. The bulk of the data falls between 8.5 and 13.5. The highest peak on the left side of the plot around 9.5, with additional peaks around 10.5 and 12.5. I used a facet wrap to see if the spikes had any relation to the quality of the wine. The peaks appeared to align closely with the different qualities of wine. The far left peak aligned closely with the “Poor” quality wine, the second peak aligned closely with the “Good” quality wine, and the far right peak seemed to line up with the “Great” quality wine.
This report explores a data set containing quality ratings and attributes for approximately 4,900 white wines. The dataset contains 13 variables, with 4,898 observations. 11 variables are of type “num”, two variables are of type “int”. I also created a new variable of type “Ordered Factor with 3 levels”.
The main feature of interest is the Quality Rating.
I expect the variables listed below to have an impact on the quality rating of the wine.
1) Volatile Acidity since high amounts can make the wine taste vinegary.
2) Total Sulfur Dioxide (SO2) since too much Sulfur Dioxide gives the wine a pungent odor - similar to that of a recently struck match.
3) Residual Sugar / Alcohol since the presence of alcohol with sugars, phenols and tannins define the balance of wine. Also, I can tell from the Alcohol Histogram facet wrapped by Quality Rating that alcohol is correlated to the quality of the wine.
4) Density since it is dependent on the percent alcohol and sugar content. Also, I can tell from the Density Histogram facet wrapped by Quality Rating that Density is correlated to the quality of the wine.
Yes, I created a ordered factor variable called quality.rating. I created this variable since some of the quality scores (3, 4, 8, 9) had very few observations. In order to make sure every rating bucket had a significant amount of observations, I combined ratings with the lower counts together. There are now three category groupings: “Poor”, “Good”, and “Great”. Poor: quality scores 3 - 5; Good: quality score 6; Great: quality scores 7 - 9. Now each quality rating bucket has a significant number of observations.
Yes, I found the following distributions unusual:
1) Residual Sugar had a very interesting distribution. The data had a multimodal distribution. The bulk of the data was smashed on the left side of the plot, but there was a significant amount of data spread out along the x-axis from 3 - 20 with several noticeable peaks. I used a facet wrap to see if the peaks in the data had any relation to the quality of the wine, but there didn’t seem to be much correlation.
2) Alcohol had an interesting distribution. The distribution is mulimodal with the highest peak on the left side of the plot around 9.5, with additional peaks around 10.5 and 12.5. I used a facet wrap to see if the spikes had any relation to the quality of the wine. The peaks appeared to align closely with the different qualities of wine. The far left peak aligned closely with the “Poor” quality wine, the second peak aligned closely with the “Good” quality wine, and the far right peak seemed to line up with the “Great” quality wine.
3) Citric Acid had spikes at 0.49 and 0.74 which I found unusual. I used a facet wrap to see if the spikes had any relation to the quality of the wine, but there didn’t appear to be any correlation.
I transformed the x-axis for the histograms for the following attributes.
1) Volatile Acidity: I transformed the x-axis using the square root function to remove the long right tail. The distribution appeared normal after the transformation.
2) Sulphates: I transformed the x-axis using the square root function to remove the long right tail. The distribution appeared normal after the tranformaiton.
3) Residual Sugar: I transformed the x-axis using the log10 function to get a better view of the bulk of the data smashed on the left side of the plot.
4) I used a Quality Rating facet wrap on the Histograms for Citric Acid, Residual Sugar, Denisty, and Alcohol to get a better idea of the relationship they had to the quality of the wine, since they each had interesting distributions.
Quality Rating is the main feature of interest, so I first wanted to get a look at the relationship it has with the other features in the dataset.
Here are a few visual observations from the scatterplots:.
1) There appears to be no noticeable correlation to quality for Fixed Acidity, Citric Acid, Residual Sugar, Free Sulfur Dioxide, pH, and Sulphates.
2) There is a negative correlation to Quality Rating for Volatile Acidity, Chlorides, Total Sulfur Dioxide, Density.
3) There is a positive correlation between Quality Rating and Alcohol.
Next, I want to run the GG Pairs matrix plot to see how the correlation coefficient for the quality relationships compares to my observations from the scatterplots. Also, the matrix plot will give me a view of the relationships between the other variables. Note: In order to run the correlation test for quality.rating I had to change the quality rating from text to a numeric value. I created a new variable named “quality.rating_num” and assigned the rating values as such: 1 - Poor, 2 - Good, 3 - Great.
Observations from the plot:
1) Based on the correlation coefficients produced, only two of the six features I identified as having a relationship to Quality Rating actually has a meaningful correlation to Quality Rating.
>> Alcohol (0.463)
>> Density (-0.332)
2) There are two other relationships with high correlation (> 0.7).
>> Residual Sugar and Density (0.839)
>> Density and Alcohol (-0.780)
3) There are two relationships with moderate correlation (> 0.5 < 0.7).
>> Free Sulfur Dioxide and Total Sulfur Dioxide (0.616)
>> Total Sulfur Dioxide and Density (0.530)
4) There are five relationships with low correlation (> 0.3 < 0.5).
>> Residual Sugar and Alcohol (-0.451)
>> Total Sulfur Dioxide and Alcohol (-0.449)
>> Fixed Acidity and pH (-0.426)
>> Residual sugar and Total Sulfur Dioxide (0.401)
>> Chlorides and Alcohol (-0.360)
5) There were other variables that appeared to be correlated based on the scatterplot, but the correlation coefficients didn’t back it up.
>> Citric Acid and Fixed Acidity (0.289)
>> Citric Acid and Volatile Acidity (-0.149)
>> Chlorides and Total Sulfur Dioxide (0.199)
>> pH and Citric Acid (-0.164)
>> Fixed Acidity and Density (0.265)
## [1] "Wine Alcohol median by Quality Rating:"
## wine$quality.rating: Bad
## [1] 9.6
## --------------------------------------------------------
## wine$quality.rating: Average
## [1] 10.5
## --------------------------------------------------------
## wine$quality.rating: Good
## [1] 11.5
Observations:
There is a positive correlation. Wine quality increases as the Alcohol content increases. This caught me by surprise. I didn’t expect alcohol content to have a significant correlation with the wine quality whether it be positive or negative. If anything, I thought the wines with high alcohol content would be rated lower.
## [1] "Wine Density median by Quality Rating:"
## wine$quality.rating: Bad
## [1] 0.99514
## --------------------------------------------------------
## wine$quality.rating: Average
## [1] 0.99366
## --------------------------------------------------------
## wine$quality.rating: Good
## [1] 0.99173
Observations:
There is a negative correlation. Wine quality increases as the density of the wine decreases. I expected to see a negative correlation, since the density of the wine depends on the alcohol and sugar content. The higher the alcohol content, the lower the density will be. Alcohol has a high positive correlation with wine quality, so I expected density to have a negative correlation.
Operations:
I limited the x-axis and y-axis to get a better view of the data.
Observations:
There is a negative correlation. Alcohol content decreases as the density increases. I expected to see this as there is a natural correlation between the two variables. Alcohol is less dense than water, so naturally as the alcohol content decreases, the density will increase.
Operations:
I limited the x-axis and y-axis to get a better view of the data.
Observations:
There is a positive correlation. Wine density increases as the residual sugar level increases. I expected to see this relationship since there is a natural correlation between the two variables. Higher residual sugar means that there is less alcohol. Less alcohol means that the density is higher since alcohol is less dense than water. Therefore, you would expect the density to be greater as the residual sugar increases.
Operations:
I limited the x-axis and y-axis to get a better view of the data.
Observations:
There is a positive correlation. Wine density increases as the total sulfur dioxide level increases.
Operations:
I limited the x-axis to get a better view of the data.
Observations:
There is a negative correlation. Wine alcohol content descreases as the amount of salt in the wine increases.
Operations:
I limited the x-axis and y-axis to get a better view of the data.
Observations:
There is a negative correlation. Wine alcohol content decreases as the total sulfur dioxide increases.
Operations:
I limited the x-axis and y-axis to get a better view of the data.
Observations:
There is a negative correlation. Wine alcohol content decreases as the residual sugar level increases. This is an expected result since there is a natural correlation between the two variables. Sugar is turned into alcohol during the fermentation process, so we would expect the alcohol content to decrease as the residual sugar level increases. Note: The majority of the wines have very little residual sugar.
Operations:
I limited the x-axis and y-axis to get a better view of the data.
Observations:
There is a positive correlation. Total sulfur dioxide increases as the free sulfur dioxide increases. This is a natural correlation, since if the free Sulfur Dioxide increases that would increase the total Sulfur Dioxide level.
Operations:
I limited the x-axis and y-axis to get a better view of the data.
Observations:
There is a positive correlation. Total sulfur dioxide increases as the residual sugar level increases. The majority of the wines have very low residual sugar counts and the total Sulfur Dioxide for these wines varies greatly.
Operations:
I limited the x-axis and y-axis to get a better view of the data.
Observations:
There is a negative correlation. pH level decreases as the fixed acidity increases. This is probably a natural correlation we’re seeing as low pH values mean there is high acidity and high pH values mean low acidity. Therefore, as fixed acidity increases you would expect the pH value to decrease.
Quality Rating isn’t strongly correlated to any of the other features in the dataset. Only Alcohol and Density share a meaningful relationship with Quality Rating.
Alcohol has a positive correlation (0.463) with Quality Rating, meaning the wine quality increases as the Alcohol content increases. I would have expected to see Residual Sugar have a stronger negative correlation (-0.126) to Quality Rating since Residual Sugar shares a natural correlated to Alcohol.
Density has a negative correlation (-0.332) with Quality Rating, meaning wine quality increases as the density of the wine decreases. I expected to see a negative correlation, since the density of the wine depends on the alcohol and sugar content. The higher the alcohol content, the lower the density will be.
Density, Residual Sugar, and Alcohol have an interesting relationship. Sugar is turned into alcohol during the fermentation process and alcohol is less dense than water. So, I would expect to see residual sugar and density decrease as the alcohol content increases. I would also expect to see the density increase as the residual sugar increases.
I also found it interesting to see that Total Sulfur Dioxide was correlated to Density, Residual Sugar and Alcohol. I wanted to figure out what was driving the relationships. Per Calwineries website - “If a winemaker uses too much SO2 (Total Sulfur Dioxide), it can kill the”good" yeast, haulting fermentation before the desired end point." My takeaway: if there is too much Total Sulfur Dioxide, the fermentation may stop early which means there will be more residual sugar, less alcohol and thus higher density.
These relationships are reflected in the correlation coefficients:
>> Alcohol and Residual Sugar (-0.451)
>> Alcohol and Density (-0.780)
>> Residual Sugar and Density (0.839)
>> Total Sulfur Dioxide and Residual sugar (0.401)
>> Total Sulfur Dioxide and Alcohol (-0.449)
>> Total Sulfur Dioxide and Density (0.530)
Residual Sugar and Density was the strongest relationship I found. The correlation score was 0.8389665. There is a positive correlation between the two. As residual sugar increase, the density increases. This is a natural correlation since the higher the residual sugar, the less alcohol there will be. The less alcohol there is, the higher the density since alcohol is less dense than water.
Observations:
Better quality wines tend to have higher alcohol content than lesser quality wines, but lower density, residual sugar, and total sulfur dioxide.
Observations:
Better quality wines tend to have less total sulfur dioxide. There doesn’t appear to be much of a relationship between quality rating and the amount of free sulfur dioxide in the wine.
Observations:
Better quality wines tend to have higher alcohol and less salt. Alcohol appears to be a bigger factor. This is noticeable in the plot by how the dots color transitions from mostly blue, to green, to red as you move from top of the plot to the bottom.
Observations:
There’s no apparent relationship between quality rating and Fixed Acidity and PH.
Better quality wines tend to have higher alcohol content, lower density, lower residual sugar, lower total sulfur dioxide, and lower salt. Lower quality wines tend to have lower alcohol content, higher density, higher residual sugar, higher total sulfur dioxide, and higher salt.
Alcohol and Density have a strong relationship and they both share a meaningful relationship with the wine quality rating. This is very apparent in the Alcohol and Density by Quality Rating scatterplots. There is clearly a negative linear relationship between alchol and density. When you overlay the quality rating you can see a clear transition in color from mostly blue in the top-left to mostly red in the bottom-right. This tells us that higher quality wines tend to have a higher alcohol content and lower denisty than lower quality wines.
Surprisingly, there doesn’t seem to be much of a relationship between the wine quality and the amount of acid in the wine. Before I started my analysis I thought the acid in the wine would be the main factor to wine quality but, as you can see in the scatterplot for FIxed Acidity and pH, there’s not any apparent relationship to quality rating.
I also found it interesting to see the strong relationship to quality rating represented in the Alcohol and Chlorides scatterplot. You can see from the plot that it is rare to find a great wine with a chlorides higher than 0.05 g / dm^3.
No
I used a histogram to see if the unique distributions for Alcohol had any relation to the quality of the wine. By doing so I was able to identify early on that Alcohol is correlated to the quality of the wine. It’s easy to see that the higher quality wines tend to have a higher alcohol content.
Alcohol and Density have a strong relationship and they both share a meaningful relationship with the wine quality rating. This is very apparent in the Alcohol and Density by Quality Rating scatterplot. There is clearly a negative linear relationship between alchol and density. When you overlay the quality rating you can see a clear transition in color from mostly blue in the top-left to mostly red in the bottom-right. This tells us that higher quality wines tend to have a higher alcohol content and lower denisty than lower quality wines. I also chose to remove the “Good” rated wines from the plot to more clearly show the difference between the “Great” and “Poor” wines.
This plot shows the relationship between pH, fixed.aciidy, and quality rating. I thought this plot was surprising because it shows that there is no apparent relationship between quality rating and the level of acidity. Going into this exercise, I thought that the acidity would be the main factor.
When I started the project I thought that acid would be the main factor in determining the wine quality rating. However, I was unable to find a meaningful relationship between acidity and wine quality during my analysis. I tried adding, subtracting, dividing, the acid features together, but nothing I tried produced significant results. Three of the things I tried were:
1) New variable equal to the sum of fixed.acidity, volatile.acidity, and citric.acid.
2) New variable equal to the quotient of volatile.acidity divided by fixed.acidity.
3) New variable equal to fixed.acidity minus volatile.acidity.
Another issue I encountered, was trying to run the correlation test with the quality.rating variable. The quality.rating variable was an ordered factor variable, so I had to create a new numerical variable to be able to calculate the correlation coeffients.
I was successful early on in identifying the key features that would have an impact on the quality of the wine. I identified the relationship alcohol and denisty shared with quality by investigating their Histogram data distribution using facet wrap on quality.rating.
I also thought the scatterplots that included quality.rating and the other features were very helpful in identifying the negative/positive relationships or lack thereof each feature shared with wine quality.
For the multivariate scatterplots, I found it to be very helpful to remove the dots for the “Average” rated wines, leaving just the “Bad” and “Good” wines. By removing the “Average” rated wines, the plot became less crowded and made it much easier to see the relationship quality rating shared with the other two variables.
A couple ways to enrich the analysis in the future are:
1) Add the red wines dataset into the mix also. You could perform similar analysis for the red wines. Then compare the difference in the way the features impact the white wines vs. the red wines. I’ve always enjoyed the taste of red wine much more than red wine, so I would be interested in seeing the results.
2) You could build a predictive model to predict the quality rating of the wine based on input values for the features.